{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## We've already seen how to implement a linear regression where we used a single variable to predict the value of another related variable. In the case where we want to predict the value of a variable using more than one variable as input then we need to use matrices.\n", "\n", "In this notebook we'll implement a multivariate linear regression. Here we'll only cover continuous covariate variables but the method works identically if we used categorical covariates - it just requires us to do some extra processing before fitting the model!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Generate data for multivariate regression" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 6.00804119 7.32002909 -9.62139603 4.97830887 8.92353483]\n" ] } ], "source": [ "n = 1000 #Number of observations in the training set\n", "p = 5 #Number of parameters, including intercept\n", "\n", "#Assign True parameters to be estimated\n", "beta = np.random.uniform(-10, 10, p) #Randomly initialise true parameters\n", "print(beta)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "X = np.random.uniform(0,10,(n,(p-1))) \n", "X0 = np.array([1]*n).reshape((n,1)) #Columns for intercept\n", "\n", "X = np.concatenate([X0,X], axis = 1) #Join intercept to other variables to form feature matrix\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "Y = np.matmul(X,beta) + np.random.normal(0,10,n) #Linear combination of the features plus a normal error term" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "#Concatenate to create dataframe\n", "\n", "dataFeatures = pd.DataFrame(X)\n", "dataFeatures.columns = [f'X{i}' for i in range(p)]\n", "\n", "dataTarget = pd.DataFrame(Y)\n", "dataTarget.columns = ['Y']\n", "\n", "data = pd.concat([dataFeatures, dataTarget], axis = 1)\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | X0 | \n", "X1 | \n", "X2 | \n", "X3 | \n", "X4 | \n", "Y | \n", "
---|---|---|---|---|---|---|
0 | \n", "1.0 | \n", "2.155360 | \n", "9.921219 | \n", "0.044325 | \n", "0.441540 | \n", "-75.531362 | \n", "
1 | \n", "1.0 | \n", "8.666179 | \n", "7.390977 | \n", "2.859317 | \n", "0.218063 | \n", "29.703580 | \n", "
2 | \n", "1.0 | \n", "3.618614 | \n", "6.246515 | \n", "0.075729 | \n", "8.564948 | \n", "69.347375 | \n", "
3 | \n", "1.0 | \n", "0.425044 | \n", "5.063864 | \n", "0.974687 | \n", "2.666221 | \n", "-2.618186 | \n", "
4 | \n", "1.0 | \n", "4.601384 | \n", "2.430928 | \n", "1.119196 | \n", "0.165348 | \n", "42.026702 | \n", "